research-article

SALAD: Static Analyzer for Loop Acceleration by Exploiting DLP

Authors:
Yang Li

Peking University, China

Peking University, China
View Profile

,
Xianfeng Li

Macau University of Science and Technology, China

Macau University of Science and Technology, China
View Profile

,
Mingtao Chen

Peking University, China

Peking University, China
View Profile

,
Weikang Zhou

Peking University, China

Peking University, China
View Profile

,
Fan Deng

Peking University, China

Peking University, China
View Profile

HP3C '21: Proceedings of the 5th International Conference on High Performance Compilation, Computing and CommunicationsJune 2021Pages 25–31https://doi.org/10.1145/3471274.3471279

Published:26 August 2021Publication History

HP3C '21: Proceedings of the 5th International Conference on High Performance Compilation, Computing and Communications

Pages 25–31

ABSTRACT

Data-intensive applications are becoming increasingly popular. However, only a few of them with high volume can afford dedicated hardware acceleration (such as Neural Network Processor, or NPU) or platform-specific software implementation (such as Tensorflow running on GPU). In this paper, we propose a hardware and software transparent framework for the acceleration of general-purpose data-intensive applications. Our framework is based on a key insight that most data-intensive applications spend the vast majority of their execution time on some inner loops with abundant opportunities for Data-Level Parallelism (DLP). In particular, we propose SALAD, a static analyzer for loop acceleration by exploiting DLP in hot loops under the LLVM (LLVM compiler infrastructure) framework. In contrast to traditional DLP exploration techniques, SALAD is both software and architectural transparent, without the need to change either the source code or binary code, and does not need vectorized instruction set architecture (ISA) extensions. Instead, it directly works on the program binary code and generates a profile for DLP opportunities in the binary. This profile will be fed to the hardware accelerator transparently to speed up execution. With the experiments result, we estimate that the DLP information provided by SALAD could result in 3.6x-60.2x speedups on a set of benchmarks, depending on their inherent DLP.

References

N. P. Jouppi, C. Young, N. Patil In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA'17), New York, NY, USA, 2017, pp. 1–12.Google ScholarDigital Library
GCC Team, Dorit Naishlos. "Autovectorization in GCC," Retrieved from https://gcc.gnu.org/pub/gcc/summit/2004/Autovectorization.pdf, June 2004.Google Scholar
Gil Rapaport and Ayal Zaks, "Introducing VPlan to the Loop Vectorizer," European LLVM Developers' Meeting 2017 Retrieved from http://llvm.org/devmtg/2017-03//assets/slides/introducing_vplan_to_the_loop_vectorizer.pdfGoogle Scholar
M. D. Ernst, "Static and dynamic analysis: Synergy and duality," in Proc. Workshop Dynamic Anal., May 9, 2003, pp. 24–27.Google Scholar
W. Heirman, D. Stroobandt, N. R. Miniskar, R. Wuyts and F. Catthoor, "PinComm: Characterizing Intra-application Communication for the Many-Core Era," 2010 IEEE 16th International Conference on Parallel and Distributed Systems, Shanghai, 2010, pp. 500-507Google Scholar
I. Ashraf, N. Khammassi, M. Taouil, and K. Bertels, "Memory and Communication Profiling for Accelerator-Based Platforms," IEEE Transactions on Computers, vol. 67, no. 7, pp. 934–948, Jul. 2018.Google ScholarDigital Library
K. Asanovic, D. A. Patterson, and C. Celio, "The berkeley out-of-order machine (boom): An industry-competitive, synthesizable, parameterized RISC-V processor," University of California at Berkeley Berkeley United States, Tech. Rep., 2015.Google Scholar
S. Srinath, B. Ilbeyi, M. Tan, G. Liu, Z. Zhang and C. Batten, "Architectural Specialization for Inter-Iteration Loop Dependence Patterns," 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, 2014, pp. 583-595.Google Scholar
Karthikeyan Sankaralingam, S. W. Keckler, W. R. Mark and D. Burger, "Universal mechanisms for data-parallel architectures," Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36., San Diego, CA, USA, 2003, pp. 303-314.Google Scholar
T. Nowatzki, V. Gangadhar, N. Ardalani and K. Sankaralingam, "Stream-dataflow acceleration," 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA'17), Toronto, ON, 2017, pp. 416-429.Google Scholar

Recommendations

Enabling SIMT Execution Model on Homogeneous Multi-Core System

Single-instruction multiple-thread (SIMT) machine emerges as a primary computing device in high-perfor-mance computing, since the SIMT execution paradigm can exploit data-level parallelism effectively. This article explores the SIMT execution potential ...
Read More
DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing

The DySER (Dynamically Specializing Execution Resources) architecture supports both functionality specialization and parallelism specialization. By dynamically specializing frequently executing regions and applying parallelism mechanisms, DySER provides ...
Read More
Converting massive TLP to DLP: a special-purpose processor for molecular orbital computations
CF '07: Proceedings of the 4th international conference on Computing frontiers

We propose an application specific processor for computational quantum chemistry. The kernel of interest is the computation of electron repulsion integrals (ERIs), which vary in control flow with different input data. This lack of uniformity limits the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

HP3C '21: Proceedings of the 5th International Conference on High Performance Compilation, Computing and Communications
June 2021
71 pages
ISBN:9781450389648
DOI:10.1145/3471274

Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 August 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
accelerator
data-level parallelism
parallel computing
static analysis
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 54
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

SALAD: Static Analyzer for Loop Acceleration by Exploiting DLP

HP3C '21: Proceedings of the 5th International Conference on High Performance Compilation, Computing and Communications

ABSTRACT

References

Cited By

Recommendations

Enabling SIMT Execution Model on Homogeneous Multi-Core System

DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing

Converting massive TLP to DLP: a special-purpose processor for molecular orbital computations

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

SALAD: Static Analyzer for Loop Acceleration by Exploiting DLP

HP3C '21: Proceedings of the 5th International Conference on High Performance Compilation, Computing and Communications

ABSTRACT

References

Cited By

Recommendations

Enabling SIMT Execution Model on Homogeneous Multi-Core System

DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing

Converting massive TLP to DLP: a special-purpose processor for molecular orbital computations

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media